############################ Before Starting ############################
# It is adviseble using the nbextension Table of Contents (2) to better #
# navigate throuh the notebook with maximum level of nested sections to #
# display on the tables of contents of 5 #
#########################################################################
Author: Alessandro Arnone
Akadelivers es una empresa de reparto a domicilio especializada en la entrega de paquetes en menos de 1 hora, lo que se denomina (Q-commerce = Quick commerce) Esta empresa tiene una aplicación móvil con la que sus usuarios pueden elegir entre un catálogo de productos de tiendas locales de su ciudad y que les sean entregados en menos de 10 minutos a la dirección que deseen.
Cuando un usuario pide un pedido a través de Akadelivers se le cobra directamente el coste total (coste del producto + gastos de servicio + gastos de envío). Una vez el usuario ha pagado un producto, el repartidor que se encuentre más próximo a la tienda que tiene el producto se acerca a esta, paga el producto, lo recoge y lo lleva a la dirección que el usuario ha elegido. Akadelivers se lo llevara a la dirección indicada.

order_id: Número de identificación del pedido.
local_time: Hora local a la que se realiza el pedido.
country_code: Código del pais en el que se realiza el pedido.
store_address: Número de tienda en a la que se realiza el pedido.
payment_status: Estado del pedido.
n_of_products: Número de productos que se han comprado en ese pedido.
products_total: Cantidad en Euros que el usuario ha comprado en la app.
final_status: Estado final del pedido (este será la variable 'target' a predecir) que indicara si el pedido será finalmente entregado o cancelado. Hay dos tipos de estado:
In this section I want to give immediatly and insights about the results and the methodology used.
In order to execute the assignement the following assumption have been taken: -there is no 'minimum order' for the transaction (Products_total variable) -Cancelled payment does not mean 'cancelled delivered' ( Status payment variable)
A variable selection has been made based on the relationship between the prediction hace towards the final status. Results are collected at each section
import pandas as pd
from datetime import datetime
#statis
from scipy.stats import chi2_contingency
#sklearn
from sklearn.model_selection import cross_val_score, RepeatedStratifiedKFold, cross_val_predict
from sklearn.model_selection import KFold, train_test_split,GridSearchCV,StratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier,AdaBoostClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, balanced_accuracy_score
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold, RandomizedSearchCV
# feat importance
import dalex as dx
#imblearn
from imblearn.pipeline import Pipeline as imbpipeline
from imblearn.pipeline import make_pipeline
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import SMOTE, ADASYN
#plotting
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.rcParams['figure.figsize'] = (16, 9)
pd.options.display.float_format = '{:.3f}'.format
#formatting
from IPython.display import display
pd.options.display.float_format = '{:,.2f}'.format
URL_TRAIN='https://challenges-asset-files.s3.us-east-2.amazonaws.com/data_sets/Data-Science/4+-+events/jump2digital/dataset/train.csv'
URL_TEST='https://challenges-asset-files.s3.us-east-2.amazonaws.com/data_sets/Data-Science/4+-+events/jump2digital/dataset/test_X.csv'
trainSet=pd.read_csv(URL_TRAIN)
Before to proceed to the count of different order, we need to verify if in the 54330 observations, there are no duplicates
trainSet.nunique()
order_id 54330 local_time 32905 country_code 23 store_address 5627 payment_status 3 n_of_products 27 products_total 3904 final_status 2 dtype: int64
Once verified we can proceed to the count of the order_id grouped by country
top3Country=trainSet.groupby('country_code')['order_id'].count().sort_values(ascending=False)[0:3].reset_index().rename(columns = {'order_id':'total_orders'})
The top 3 countries by order are: Argentina, Spain and Turkey
top3Country
| country_code | total_orders | |
|---|---|---|
| 0 | AR | 11854 |
| 1 | ES | 11554 |
| 2 | TR | 5696 |
trainSet['local_time'] = pd.to_datetime(trainSet['local_time'])
trainSet['hour']= pd.to_datetime(trainSet['local_time'], format='%H:%M:%S').dt.hour
ordersByHour=trainSet[trainSet['country_code']=='ES'].groupby('hour')['order_id'].count().sort_values(ascending=False)
hourSpain=trainSet[trainSet['country_code']=='ES']['hour']
plt.hist(hourSpain,bins=24)
plt.xlabel('Hour')
plt.ylabel('Count of Orders')
plt.title('Histogram of Order by hour')
plt.xlim(0, 24)
plt.grid(True)
plt.show()
The busiest hours are around dinner time which goes around 19-20-21 with the timespam from 20:00:00 to 20:59:59 being the busiest
ordersByHour
averageShop12513_complete=trainSet[trainSet['store_address']==12513]['products_total'].mean()
averageShop12513_onlyDelivered=trainSet[(trainSet['store_address']==12513) & (trainSet['final_status']=='DeliveredStatus')]['products_total'].mean()
print('The average price for order of ID 12513 is:',round(averageShop12513_complete,2), '[Considering all the orders]')
print('The average price for order of ID 12513 is:',round(averageShop12513_onlyDelivered,2), '[Considering only Completed orders]')
The average price for order of ID 12513 is: 17.39 [Considering all the orders] The average price for order of ID 12513 is: 17.38 [Considering only Completed orders]
Teniendo en cuenta los picos de demanda en España, si los repartidores trabajan en turnos de 8horas.
Qué porcentaje de repartidores pondrías por cada turno para que sean capaces de hacer frente a los picos de demanda. (ej: Turno 1 el 30%, Turno 2 el 10% y Turno 3 el 60%).
bins = [0, 7, 15, 24]
# add custom labels if desired
labels = ['00:00-07:59', '08:00-15:59', '16:00-23:59']
# add the bins to the dataframe
trainSet['time_bin'] = pd.cut(trainSet['hour'], bins, labels=labels, right=False)
orderByBinnedHour=trainSet[(trainSet['final_status']=='DeliveredStatus') & (trainSet['country_code']=='ES')].groupby('time_bin')['order_id'].count().rename("percentage").transform(lambda x: (x/x.sum()))
orderByBinnedHour
time_bin 00:00-07:59 0.00 08:00-15:59 0.34 16:00-23:59 0.66 Name: percentage, dtype: float64
trainSet['final_status_binary']=pd.get_dummies(trainSet['final_status'], drop_first=True)
trainSet['products_total'].hist()
<AxesSubplot:>
ax=sns.countplot(x="payment_status", hue="final_status", data=trainSet)
total = len(trainSet)
for p in ax.patches:
percentage = f'{100 * p.get_height() / total:.1f}%\n'
x = p.get_x() + p.get_width() / 2
y = p.get_height()
ax.annotate(percentage, (x, y), ha='center', va='center')
plt.tight_layout()
plt.show()
trainSet.groupby(['payment_status','final_status'])['order_id'].count().rename("percentage").transform(lambda x: x/x.sum())
payment_status final_status
DELAYED CanceledStatus 0.00
DeliveredStatus 0.00
NOT_PAID CanceledStatus 0.00
DeliveredStatus 0.01
PAID CanceledStatus 0.11
DeliveredStatus 0.89
Name: percentage, dtype: float64
Based on that, It looks that when the status of a transaction is NOT_PAID the probability of being cancelled increase. Hence It will be included in our model
To assess this information a chi-square test will be performed ( categorical vs categorical). If the H(0)=Indipendency cannot be rejected, the two variables will be considered dependent hence a variability in the final status can be expalined by the variability in the variable payment_statys
chi2, p, dof, expected = chi2_contingency((pd.crosstab(trainSet.payment_status, trainSet.final_status).values))
print (f'Chi-square Statistic : {chi2} ,p-value: {p}')
Chi-square Statistic : 102.42785141954717 ,p-value: 5.728945194717638e-23
We can reject the null hypothesis and conclude there is a relationship between payment_status and final_status
fig, (ax1, ax2) = plt.subplots(1,2)
ax1.boxplot(trainSet['n_of_products'],0)
ax1.set_ylabel("Number of product")
ax1.set_title("Product number - whisker plot", fontsize=12)
ax2.set_ylabel("Frequency")
ax2.set_xlabel("Quantity of products")
ax2.hist(trainSet['n_of_products'], color='darkslateblue',bins=40, alpha=0.9)
ax2.set_title(" Product number distribution of all transactions", fontsize=12)
plt.show()
-Outliers check:
- Median centered around 2
- 62% of the transaction has 2 or 1 product
- 89 % of the transaction has less or equal to 5 products
-
- Distribution positevely skewed
- Extreme values are expected according the right skewed distribution hence will not be removed
groupinProduct=trainSet.groupby('n_of_products').count().reset_index()
totalTransaction=pd.DataFrame()
totalTransaction['n_of_products']=groupinProduct['n_of_products']
totalTransaction['totalTransaction']=groupinProduct['final_status_binary']
groupinProduct=trainSet.groupby('n_of_products').sum().reset_index()
totalTransaction['numberDelivered']=groupinProduct['final_status_binary']
totalTransaction['percentageDelivered']=round(totalTransaction['numberDelivered']/totalTransaction['totalTransaction'],2)
fig = plt.figure()
ax = plt.axes()
ax.plot(totalTransaction['n_of_products'], totalTransaction['percentageDelivered'])
plt.title('Percentage of Delivered by Product number')
Text(0.5, 1.0, 'Percentage of Delivered by Product number')
sns.countplot(trainSet['n_of_products'], hue=trainSet['final_status_binary'])
/Users/alessandro/opt/anaconda3/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
<AxesSubplot:xlabel='n_of_products', ylabel='count'>
from scipy import stats# point-biserial correlation
# output is a tuple
result = stats.pointbiserialr(trainSet['n_of_products'], trainSet['final_status_binary'])
print(f'correlation between X and y: {result[0]:.2f}')
print(f'p-value: {result[1]:.2g}')
correlation between X and y: 0.02 p-value: 2.3e-05
Percentage of delivered is constant until 13 products ( which account the 99%+ of the data) hence It does not look that product number can be used in our model since it does not explain any variability of our target variable. Moreover as can be noticed by the Point biserial test - used for testing the dependency between number of products and our target variable we have a 0 correlation with an realiable p-value
ax1=sns.distplot(trainSet[trainSet['final_status']=='DeliveredStatus']['hour'], label='Delivered')
ax2=sns.distplot(trainSet[trainSet['final_status']=='CanceledStatus']['hour'], label='Cancelled')
plt.xlabel('Probability by Status', fontsize=15)
plt.ylabel('Probability Density', fontsize=15)
plt.legend(fontsize=15)
plt.show()
/Users/alessandro/opt/anaconda3/lib/python3.8/site-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) /Users/alessandro/opt/anaconda3/lib/python3.8/site-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
bin_orders = [0, 6, 12, 18,24]
labels = ['Night_orders', 'Morning_orders', 'Afternoon_orders','Evening_orders']
trainSet['bin_orders'] = pd.cut(trainSet['hour'], bin_orders, labels=labels, right=False)
groupinProduct=trainSet.groupby('bin_orders').count().reset_index()
totalTransaction=pd.DataFrame()
totalTransaction['bin_orders']=groupinProduct['bin_orders']
totalTransaction['totalTransaction']=groupinProduct['final_status_binary']
groupinProduct=trainSet.groupby('bin_orders').sum().reset_index()
totalTransaction['numberDelivered']=groupinProduct['final_status_binary']
totalTransaction['percentageDelivered']=round(totalTransaction['numberDelivered']/totalTransaction['totalTransaction'],2)
totalTransaction
| bin_orders | totalTransaction | numberDelivered | percentageDelivered | |
|---|---|---|---|---|
| 0 | Night_orders | 631 | 424.00 | 0.67 |
| 1 | Morning_orders | 6671 | 6,035.00 | 0.90 |
| 2 | Afternoon_orders | 21338 | 19,236.00 | 0.90 |
| 3 | Evening_orders | 25690 | 22,803.00 | 0.89 |
We can notice above that there is a different trend between the night orders (defines as order from minight until 6am) and the rest: in fact the probability that those are cancelled it's much higher. Based on this we will include Hour of delivery inside our model
trainSet['products_total'].describe()
count 54,330.00 mean 9.84 std 9.26 min 0.00 25% 4.13 50% 7.13 75% 12.77 max 221.48 Name: products_total, dtype: float64
fig, (ax1, ax2) = plt.subplots(1,2)
ax1.boxplot(trainSet['products_total'],0)
ax1.set_ylabel("Number of product")
ax1.set_title("Product number - whisker plot", fontsize=12)
ax2.set_ylabel("Frequency")
ax2.set_xlabel("Amount of transactions")
ax2.hist(trainSet['products_total'], color='darkslateblue',bins=500, alpha=0.9)
ax2.set_title(" Product number distribution of all transactions", fontsize=12)
plt.show()
fig, (ax1, ax2) = plt.subplots(1,2)
fig.suptitle('Focus on products_total < 1')
ax1.boxplot(trainSet[trainSet['products_total']<1]['products_total'],0)
ax1.set_ylabel("Number of product")
ax1.set_title("Product number - whisker plot", fontsize=12)
ax2.set_ylabel("Frequency")
ax2.set_xlabel("Quantity")
ax2.hist(trainSet[trainSet['products_total']<1]['products_total'], alpha=0.9)
ax2.set_title(" Product number distribution of all transactions", fontsize=12)
plt.show()
The Probability Density plot is almost overallping amongst the 2 class of our dependent variable except for the tail
ax1=sns.distplot(trainSet[trainSet['final_status']=='DeliveredStatus']['products_total'], bins=200,label='Delivered')
ax2=sns.distplot(trainSet[trainSet['final_status']=='CanceledStatus']['products_total'],bins=200, label='Cancelled')
plt.xlabel('Density Plot for Product Price', fontsize=15)
plt.ylabel('Probability Density', fontsize=15)
plt.legend(fontsize=15)
plt.show()
/Users/alessandro/opt/anaconda3/lib/python3.8/site-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) /Users/alessandro/opt/anaconda3/lib/python3.8/site-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
ax1=sns.distplot(trainSet[(trainSet['final_status']=='DeliveredStatus') & (trainSet['products_total']>50)]['products_total'], bins=20 ,label='Delivered')
ax2=sns.distplot(trainSet[(trainSet['final_status']=='CanceledStatus') & (trainSet['products_total']>50)]['products_total'], bins=20,label='Cancelled')
plt.xlabel('Density Plot for Product Price', fontsize=15)
plt.ylabel('Probability Density', fontsize=15)
plt.legend(fontsize=15)
plt.show()
/Users/alessandro/opt/anaconda3/lib/python3.8/site-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) /Users/alessandro/opt/anaconda3/lib/python3.8/site-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
bin_products = [0, 50,100, 260]
labels = ['Total_Amount< 60','60 < Total_Amount < 100', 'Total_Amount>100']
trainSet['bin_products'] = pd.cut(trainSet['products_total'], bin_products, labels=labels, right=False)
groupinProduct=trainSet.groupby('bin_products').count().reset_index()
totalTransaction=pd.DataFrame()
totalTransaction['bin_products']=groupinProduct['bin_products']
totalTransaction['totalTransaction']=groupinProduct['final_status_binary']
groupinProduct=trainSet.groupby('bin_products').sum().reset_index()
totalTransaction['numberDelivered']=groupinProduct['final_status_binary']
totalTransaction['percentageDelivered']=round(totalTransaction['numberDelivered']/totalTransaction['totalTransaction'],2)
totalTransaction
| bin_products | totalTransaction | numberDelivered | percentageDelivered | |
|---|---|---|---|---|
| 0 | Total_Amount< 60 | 53987 | 48,232.00 | 0.89 |
| 1 | 60 < Total_Amount < 100 | 321 | 253.00 | 0.79 |
| 2 | Total_Amount>100 | 22 | 13.00 | 0.59 |
result = stats.pointbiserialr(trainSet['products_total'], trainSet['final_status_binary'])
print('Product_total CONTINOUS:')
print(f'correlation between X and y: {result[0]:.2f}')
print(f'p-value: {result[1]:.2g}')
chi2, p, dof, expected = chi2_contingency((pd.crosstab(trainSet.products_total, trainSet.final_status).values))
print('\nProduct_total BINNED:')
print (f'Chi-square Statistic : {chi2} ,p-value: {p}')
Product_total CONTINOUS: correlation between X and y: -0.02 p-value: 3.1e-06 Product_total BINNED: Chi-square Statistic : 4806.578714548748 ,p-value: 7.333811402755825e-22
It might seems that the bigger is the total amount of the transaction, less is the probability to have our order completed. In fact it's much more likely that an order is delivered if the total amount is < 60 compared if the total amount is > 100. Unfortutanetly the quanitity of transactions that follow the latest pattern described does not represent a numerous sample inside our database.
sns.pointplot(x='country_code',y='products_total', hue='final_status_binary',data=trainSet)
<AxesSubplot:xlabel='country_code', ylabel='products_total'>
sns.countplot(x='country_code',hue='final_status_binary',data=trainSet, order = trainSet['country_code'].value_counts().index)
plt.title("Count of Cancelled vs delivered transaction")
Text(0.5, 1.0, 'Count of Cancelled vs delivered transaction')
sns.boxplot(x='country_code',y='products_total',hue='final_status_binary', data=trainSet)
plt.title("Age by Passenger Class, Titanic")
Text(0.5, 1.0, 'Age by Passenger Class, Titanic')
groupinProduct=trainSet.groupby('country_code').count().reset_index()
totalTransaction=pd.DataFrame()
totalTransaction['country_code']=groupinProduct['country_code']
totalTransaction['totalTransaction']=groupinProduct['final_status_binary']
groupinProduct=trainSet.groupby('country_code').sum().reset_index()
totalTransaction['numberDelivered']=groupinProduct['final_status_binary']
totalTransaction['percentageDelivered']=round(totalTransaction['numberDelivered']/totalTransaction['totalTransaction'],2)
totalTransaction.sort_values(by='percentageDelivered',ascending=False)
| country_code | totalTransaction | numberDelivered | percentageDelivered | |
|---|---|---|---|---|
| 4 | CR | 1000 | 926.00 | 0.93 |
| 11 | GT | 511 | 468.00 | 0.92 |
| 9 | FR | 1911 | 1,754.00 | 0.92 |
| 19 | RO | 1957 | 1,799.00 | 0.92 |
| 13 | KE | 84 | 77.00 | 0.92 |
| 8 | ES | 11554 | 10,634.00 | 0.92 |
| 16 | PE | 4284 | 3,923.00 | 0.92 |
| 5 | DO | 448 | 409.00 | 0.91 |
| 20 | TR | 5696 | 5,180.00 | 0.91 |
| 6 | EC | 2265 | 2,031.00 | 0.90 |
| 12 | IT | 2537 | 2,276.00 | 0.90 |
| 21 | UA | 3729 | 3,330.00 | 0.89 |
| 15 | PA | 909 | 806.00 | 0.89 |
| 7 | EG | 1643 | 1,447.00 | 0.88 |
| 17 | PR | 29 | 25.00 | 0.86 |
| 10 | GE | 485 | 415.00 | 0.86 |
| 3 | CL | 994 | 857.00 | 0.86 |
| 0 | AR | 11854 | 10,107.00 | 0.85 |
| 14 | MA | 1446 | 1,222.00 | 0.85 |
| 18 | PT | 818 | 684.00 | 0.84 |
| 22 | UY | 169 | 125.00 | 0.74 |
| 2 | CI | 6 | 3.00 | 0.50 |
| 1 | BR | 1 | 0.00 | 0.00 |
Most numerous transaction have different probability of being marked as delivered. Example:
Both together account around the 40% of the database hence they can potentially contribute to explain the variability of our depdentent variable.
groupinProduct=trainSet.groupby('store_address').count().reset_index()
totalTransaction=pd.DataFrame()
totalTransaction['store_address']=groupinProduct['store_address']
totalTransaction['totalTransaction']=groupinProduct['final_status_binary']
groupinProduct=trainSet.groupby('store_address').sum().reset_index()
totalTransaction['numberDelivered']=groupinProduct['final_status_binary']
totalTransaction['percentageDelivered']=round(totalTransaction['numberDelivered']/totalTransaction['totalTransaction'],2)
totalTransaction.sort_values(by='numberDelivered', ascending=False)
| store_address | totalTransaction | numberDelivered | percentageDelivered | |
|---|---|---|---|---|
| 1350 | 28671 | 455 | 433.00 | 0.95 |
| 685 | 12513 | 245 | 239.00 | 0.98 |
| 775 | 14455 | 227 | 215.00 | 0.95 |
| 1356 | 28712 | 221 | 209.00 | 0.95 |
| 1351 | 28675 | 228 | 199.00 | 0.87 |
| ... | ... | ... | ... | ... |
| 3870 | 62760 | 2 | 0.00 | 0.00 |
| 2674 | 51139 | 1 | 0.00 | 0.00 |
| 3864 | 62698 | 1 | 0.00 | 0.00 |
| 4780 | 68973 | 1 | 0.00 | 0.00 |
| 3137 | 56000 | 4 | 0.00 | 0.00 |
5627 rows × 4 columns
test=totalTransaction
plt.scatter(test['totalTransaction'],test['percentageDelivered'])
<matplotlib.collections.PathCollection at 0x123db49d0>
test=test[test['totalTransaction']<60]
plt.scatter(test['totalTransaction'],test['percentageDelivered'])
<matplotlib.collections.PathCollection at 0x124e23760>
The higher the number of transactions by shop, the higher the probability that the order is gonna be delivered - trend valid for shop whose number of transaction is higher of 60
fig, (ax1, ax2) = plt.subplots(1,2)
ax1.boxplot(test['numberDelivered'],0)
ax1.set_ylabel("Number of product")
ax1.set_title("Product number - whisker plot", fontsize=12)
ax2.set_ylabel("Frequency")
ax2.set_xlabel("Amount of transactions")
ax2.hist(test['numberDelivered'], color='darkslateblue',bins=500, alpha=0.9)
ax2.set_title(" Product number distribution of all transactions", fontsize=12)
plt.show()
trainSet['highTransaction']=0
store_address_small=totalTransaction[(totalTransaction['totalTransaction']<25) ]['store_address']
store_address_small
0 190
1 191
2 193
3 194
4 196
...
5622 74863
5623 74871
5624 74873
5625 74889
5626 75236
Name: store_address, Length: 5097, dtype: int64
index_store_address_small=trainSet[trainSet.store_address.isin(store_address_small)].index
# set individual value once more
trainSet.loc[index_store_address_small, 'highTransaction'] = 1
sns.countplot(x='highTransaction',hue='final_status',data=trainSet)
plt.title("Count of Cancelled vs delivered transaction")
Text(0.5, 1.0, 'Count of Cancelled vs delivered transaction')
X=trainSet.copy()
y=trainSet['final_status_binary']
country_binned=pd.get_dummies(trainSet['country_code'])
payment_binned=pd.get_dummies(trainSet['payment_status'])
X=pd.concat([X, country_binned], axis=1)
X=pd.concat([X, payment_binned], axis=1)
X.columns
Index(['order_id', 'local_time', 'country_code', 'store_address',
'payment_status', 'n_of_products', 'products_total', 'final_status',
'hour', 'time_bin', 'final_status_binary', 'bin_orders', 'bin_products',
'highTransaction', 'AR', 'BR', 'CI', 'CL', 'CR', 'DO', 'EC', 'EG', 'ES',
'FR', 'GE', 'GT', 'IT', 'KE', 'MA', 'PA', 'PE', 'PR', 'PT', 'RO', 'TR',
'UA', 'UY', 'DELAYED', 'NOT_PAID', 'PAID'],
dtype='object')
X.drop(['order_id', 'local_time', 'country_code',
'payment_status', 'n_of_products', 'final_status','time_bin', 'final_status_binary', 'bin_orders',
'bin_products','store_address'], axis=1, inplace=True)
X.columns
Index(['products_total', 'hour', 'highTransaction', 'AR', 'BR', 'CI', 'CL',
'CR', 'DO', 'EC', 'EG', 'ES', 'FR', 'GE', 'GT', 'IT', 'KE', 'MA', 'PA',
'PE', 'PR', 'PT', 'RO', 'TR', 'UA', 'UY', 'DELAYED', 'NOT_PAID',
'PAID'],
dtype='object')
clf = RandomForestClassifier()
clf.fit(X, y)
DT_Dummy = dx.Explainer(clf, X, y,
label = "Dummy Model - Random Forest")
mp_rf = DT_Dummy.model_parts()
mp_rf.result
mp_rf.plot()
Preparation of a new explainer is initiated -> data : 54330 rows 29 cols -> target variable : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray. -> target variable : 54330 values -> model_class : sklearn.ensemble._forest.RandomForestClassifier (default) -> label : Dummy Model - Random Forest -> predict function : <function yhat_proba_default at 0x121251670> will be used (default) -> predict function : Accepts pandas.DataFrame and numpy.ndarray. -> predicted values : min = 0.0, mean = 0.892, max = 1.0 -> model type : classification will be used (default) -> residual function : difference between y and yhat (default) -> residuals : min = -0.949, mean = 0.000287, max = 0.745 -> model_info : package sklearn A new explainer has been created!
np.random.seed(42)
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size = 0.2, stratify=y, random_state = 42)
print(X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=1, random_state=42)
(43464, 29) (10866, 29) (43464,) (10866,)
# define model
model = RandomForestClassifier()
# evaluate evaluate
scores = cross_val_score(model, X_train, Y_train, scoring='f1', cv=cv, n_jobs=-1)
preds = cross_val_predict(model, X_test, Y_test)
print(scores)
print(classification_report(Y_test, preds))
[0.91165629 0.91479934 0.91734075 0.91627612 0.9118657 0.91244591
0.91363983 0.91592808 0.9154002 0.91472081]
precision recall f1-score support
0 0.20 0.12 0.15 1166
1 0.90 0.94 0.92 9700
accuracy 0.85 10866
macro avg 0.55 0.53 0.53 10866
weighted avg 0.82 0.85 0.84 10866
pipeline = imbpipeline(steps = [['smote', SMOTE(random_state=11)],
['classifier', LogisticRegression(random_state=11,
max_iter=1000)]])
param_grid = {'classifier__C':[0.001, 0.01, 0.1, 1, 10, 100, 1000]}
grid_search = GridSearchCV(estimator=pipeline,
param_grid=param_grid,
scoring='f1',
cv=cv,
n_jobs=-1)
grid_search.fit(X_train, Y_train)
cv_score = grid_search.best_score_
test_score = grid_search.score(X_test, Y_test)
print(f'Cross-validation score: {cv_score}\nTest score: {test_score}')
print(classification_report(Y_test, grid_search.predict(X_test)))
Cross-validation score: 0.8268647968793952
Test score: 0.827284946236559
precision recall f1-score support
0 0.15 0.34 0.20 1166
1 0.91 0.76 0.83 9700
accuracy 0.72 10866
macro avg 0.53 0.55 0.52 10866
weighted avg 0.82 0.72 0.76 10866
pipeline = imbpipeline(steps = [['smote', SMOTE(random_state=11)],
['classifier', RandomForestClassifier(verbose=2)]])
# Number of trees in random forest
n_estimators = [10,20]
# Maximum number of levels in tree
max_depth = [10,20]
# Create the random grid
random_grid = {'classifier__n_estimators': n_estimators,
'classifier__max_depth': max_depth
}
grid_search = RandomizedSearchCV(estimator=pipeline,
param_distributions = random_grid,
scoring='f1',
cv=cv,
n_jobs=-1)
grid_search.fit(X_train, Y_train)
cv_score = grid_search.best_score_
test_score = grid_search.score(X_test, Y_test)
print(f'Cross-validation score: {cv_score}\nTest score: {test_score}')
print(classification_report(Y_test, grid_search.predict(X_test)))
/Users/alessandro/opt/anaconda3/lib/python3.8/site-packages/sklearn/model_selection/_search.py:285: UserWarning: The total space of parameters 4 is smaller than n_iter=10. Running 4 iterations. For exhaustive searches, use GridSearchCV.
--------------------------------------------------------------------------- KeyboardInterrupt Traceback (most recent call last) <ipython-input-100-f83cc96d2ecd> in <module> 19 n_jobs=-1) 20 ---> 21 grid_search.fit(X_train, Y_train) 22 cv_score = grid_search.best_score_ 23 test_score = grid_search.score(X_test, Y_test) ~/opt/anaconda3/lib/python3.8/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs) 61 extra_args = len(args) - len(all_args) 62 if extra_args <= 0: ---> 63 return f(*args, **kwargs) 64 65 # extra_args > 0 ~/opt/anaconda3/lib/python3.8/site-packages/sklearn/model_selection/_search.py in fit(self, X, y, groups, **fit_params) 839 return results 840 --> 841 self._run_search(evaluate_candidates) 842 843 # multimetric is determined here because in the case of a callable ~/opt/anaconda3/lib/python3.8/site-packages/sklearn/model_selection/_search.py in _run_search(self, evaluate_candidates) 1617 def _run_search(self, evaluate_candidates): 1618 """Search n_iter candidates from param_distributions""" -> 1619 evaluate_candidates(ParameterSampler( 1620 self.param_distributions, self.n_iter, 1621 random_state=self.random_state)) ~/opt/anaconda3/lib/python3.8/site-packages/sklearn/model_selection/_search.py in evaluate_candidates(candidate_params, cv, more_results) 793 n_splits, n_candidates, n_candidates * n_splits)) 794 --> 795 out = parallel(delayed(_fit_and_score)(clone(base_estimator), 796 X, y, 797 train=train, test=test, ~/opt/anaconda3/lib/python3.8/site-packages/joblib/parallel.py in __call__(self, iterable) 1052 1053 with self._backend.retrieval_context(): -> 1054 self.retrieve() 1055 # Make sure that we get a last message telling us we are done 1056 elapsed_time = time.time() - self._start_time ~/opt/anaconda3/lib/python3.8/site-packages/joblib/parallel.py in retrieve(self) 931 try: 932 if getattr(self._backend, 'supports_timeout', False): --> 933 self._output.extend(job.get(timeout=self.timeout)) 934 else: 935 self._output.extend(job.get()) ~/opt/anaconda3/lib/python3.8/site-packages/joblib/_parallel_backends.py in wrap_future_result(future, timeout) 540 AsyncResults.get from multiprocessing.""" 541 try: --> 542 return future.result(timeout=timeout) 543 except CfTimeoutError as e: 544 raise TimeoutError from e ~/opt/anaconda3/lib/python3.8/concurrent/futures/_base.py in result(self, timeout) 432 return self.__get_result() 433 --> 434 self._condition.wait(timeout) 435 436 if self._state in [CANCELLED, CANCELLED_AND_NOTIFIED]: ~/opt/anaconda3/lib/python3.8/threading.py in wait(self, timeout) 300 try: # restore state no matter what (e.g., KeyboardInterrupt) 301 if timeout is None: --> 302 waiter.acquire() 303 gotit = True 304 else: KeyboardInterrupt:
pipeline = imbpipeline(steps = [['smote', SMOTE()],
['under', RandomUnderSampler()],
['classifier', RandomForestClassifier()]])
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 100, stop = 1000, num = 5)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 50, num = 11)]
# Create the random grid
random_grid = {'classifier__n_estimators': [1,2],
'classifier__max_depth': max_depth,
'classifier__max_features':max_features
}
grid_search = RandomizedSearchCV(estimator=pipeline,
param_distributions = random_grid,
scoring='f1',
cv=cv,
n_jobs=-1)
grid_search.fit(X_train,Y_train)
cv_score = grid_search.best_score_
test_score = grid_search.score(X_test, Y_test)
print(f'Cross-validation score: {cv_score}\n Test score: {test_score}')
print(classification_report(Y_test, grid_search.predict(X_test)))
pipeline = imbpipeline(steps = [['smote', SMOTE()],
['under', RandomUnderSampler()],
['classifier',AdaBoostClassifier()]])
n_estimators = [500 ]
# Number of features to consider at every split
# Maximum number of levels in tree
# Create the random grid
random_grid = {'classifier__n_estimators': n_estimators,
}
grid_search = GridSearchCV(estimator=pipeline,
param_grid= random_grid,
scoring='f1',
cv=cv,
n_jobs=-1, verbose=3)
grid_search.fit(X_train, Y_train)
cv_score = grid_search.best_score_
test_score = grid_search.score(X_test, Y_test)
print(f'Cross-validation score: {cv_score}\n Test score: {test_score}')
print(classification_report(Y_test, grid_search.predict(X_test)))
pipeline = imbpipeline(steps = [['smote', SMOTE()],
['under', RandomUnderSampler()],
['classifier',GradientBoostingClassifier()]])
# Number of trees in random forest
n_estimators = [100]
# Create the random grid
fixed_grid = {'classifier__n_estimators': n_estimators,
}
grid_search = GridSearchCV(estimator=pipeline,
param_grid= fixed_grid,
scoring='f1',
cv=cv,
n_jobs=-1)
grid_search.fit(X_train, Y_train)
cv_score = grid_search.best_score_
test_score = grid_search.score(X_test, Y_test)
print(f'Cross-validation score: {cv_score}\n Test score: {test_score}')
print(classification_report(Y_test, grid_search.predict(X_test)))
testSet=pd.read_csv(URL_TEST, sep=';')
--------------------------------------------------------------------------- HTTPError Traceback (most recent call last) <ipython-input-149-e05296b2fe11> in <module> ----> 1 testSet=pd.read_csv(URL_TEST, sep=';') ~/opt/anaconda3/lib/python3.8/site-packages/pandas/io/parsers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options) 608 kwds.update(kwds_defaults) 609 --> 610 return _read(filepath_or_buffer, kwds) 611 612 ~/opt/anaconda3/lib/python3.8/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds) 460 461 # Create the parser. --> 462 parser = TextFileReader(filepath_or_buffer, **kwds) 463 464 if chunksize or iterator: ~/opt/anaconda3/lib/python3.8/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds) 817 self.options["has_index_names"] = kwds["has_index_names"] 818 --> 819 self._engine = self._make_engine(self.engine) 820 821 def close(self): ~/opt/anaconda3/lib/python3.8/site-packages/pandas/io/parsers.py in _make_engine(self, engine) 1048 ) 1049 # error: Too many arguments for "ParserBase" -> 1050 return mapping[engine](self.f, **self.options) # type: ignore[call-arg] 1051 1052 def _failover_to_python(self): ~/opt/anaconda3/lib/python3.8/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds) 1865 1866 # open handles -> 1867 self._open_handles(src, kwds) 1868 assert self.handles is not None 1869 for key in ("storage_options", "encoding", "memory_map", "compression"): ~/opt/anaconda3/lib/python3.8/site-packages/pandas/io/parsers.py in _open_handles(self, src, kwds) 1360 Let the readers open IOHanldes after they are done with their potential raises. 1361 """ -> 1362 self.handles = get_handle( 1363 src, 1364 "r", ~/opt/anaconda3/lib/python3.8/site-packages/pandas/io/common.py in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options) 556 557 # open URLs --> 558 ioargs = _get_filepath_or_buffer( 559 path_or_buf, 560 encoding=encoding, ~/opt/anaconda3/lib/python3.8/site-packages/pandas/io/common.py in _get_filepath_or_buffer(filepath_or_buffer, encoding, compression, mode, storage_options) 287 "storage_options passed with file object or non-fsspec file path" 288 ) --> 289 req = urlopen(filepath_or_buffer) 290 content_encoding = req.headers.get("Content-Encoding", None) 291 if content_encoding == "gzip": ~/opt/anaconda3/lib/python3.8/site-packages/pandas/io/common.py in urlopen(*args, **kwargs) 193 import urllib.request 194 --> 195 return urllib.request.urlopen(*args, **kwargs) 196 197 ~/opt/anaconda3/lib/python3.8/urllib/request.py in urlopen(url, data, timeout, cafile, capath, cadefault, context) 220 else: 221 opener = _opener --> 222 return opener.open(url, data, timeout) 223 224 def install_opener(opener): ~/opt/anaconda3/lib/python3.8/urllib/request.py in open(self, fullurl, data, timeout) 529 for processor in self.process_response.get(protocol, []): 530 meth = getattr(processor, meth_name) --> 531 response = meth(req, response) 532 533 return response ~/opt/anaconda3/lib/python3.8/urllib/request.py in http_response(self, request, response) 638 # request was successfully received, understood, and accepted. 639 if not (200 <= code < 300): --> 640 response = self.parent.error( 641 'http', request, response, code, msg, hdrs) 642 ~/opt/anaconda3/lib/python3.8/urllib/request.py in error(self, proto, *args) 567 if http_err: 568 args = (dict, 'default', 'http_error_default') + orig_args --> 569 return self._call_chain(*args) 570 571 # XXX probably also want an abstract factory that knows when it makes ~/opt/anaconda3/lib/python3.8/urllib/request.py in _call_chain(self, chain, kind, meth_name, *args) 500 for handler in handlers: 501 func = getattr(handler, meth_name) --> 502 result = func(*args) 503 if result is not None: 504 return result ~/opt/anaconda3/lib/python3.8/urllib/request.py in http_error_default(self, req, fp, code, msg, hdrs) 647 class HTTPDefaultErrorHandler(BaseHandler): 648 def http_error_default(self, req, fp, code, msg, hdrs): --> 649 raise HTTPError(req.full_url, code, msg, hdrs, fp) 650 651 class HTTPRedirectHandler(BaseHandler): HTTPError: HTTP Error 403: Forbidden
testSet
testSet['local_time'] = pd.to_datetime(testSet['local_time'])
testSet['hour']= pd.to_datetime(testSet['local_time'], format='%H:%M:%S').dt.hour
country_binned=pd.get_dummies(testSet['country_code'])
payment_binned=pd.get_dummies(testSet['payment_status'])
testSet=pd.concat([testSet, country_binned], axis=1)
testSet=pd.concat([testSet, payment_binned], axis=1)
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) ~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance) 3079 try: -> 3080 return self._engine.get_loc(casted_key) 3081 except KeyError as err: pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() KeyError: 'local_time' The above exception was the direct cause of the following exception: KeyError Traceback (most recent call last) <ipython-input-142-37f72523b232> in <module> ----> 1 testSet['local_time'] = pd.to_datetime(testSet['local_time']) 2 testSet['hour']= pd.to_datetime(testSet['local_time'], format='%H:%M:%S').dt.hour 3 4 country_binned=pd.get_dummies(testSet['country_code']) 5 payment_binned=pd.get_dummies(testSet['payment_status']) ~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/frame.py in __getitem__(self, key) 3022 if self.columns.nlevels > 1: 3023 return self._getitem_multilevel(key) -> 3024 indexer = self.columns.get_loc(key) 3025 if is_integer(indexer): 3026 indexer = [indexer] ~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance) 3080 return self._engine.get_loc(casted_key) 3081 except KeyError as err: -> 3082 raise KeyError(key) from err 3083 3084 if tolerance is not None: KeyError: 'local_time'
testSet.drop(['order_id', 'local_time', 'country_code',
'payment_status', 'n_of_products','store_address'], axis=1, inplace=True)
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) <ipython-input-143-2df882039d89> in <module> ----> 1 testSet.drop(['order_id', 'local_time', 'country_code', 2 'payment_status', 'n_of_products','store_address'], axis=1, inplace=True) 3 ~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/frame.py in drop(self, labels, axis, index, columns, level, inplace, errors) 4306 weight 1.0 0.8 4307 """ -> 4308 return super().drop( 4309 labels=labels, 4310 axis=axis, ~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/generic.py in drop(self, labels, axis, index, columns, level, inplace, errors) 4151 for axis, labels in axes.items(): 4152 if labels is not None: -> 4153 obj = obj._drop_axis(labels, axis, level=level, errors=errors) 4154 4155 if inplace: ~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/generic.py in _drop_axis(self, labels, axis, level, errors) 4186 new_axis = axis.drop(labels, level=level, errors=errors) 4187 else: -> 4188 new_axis = axis.drop(labels, errors=errors) 4189 result = self.reindex(**{axis_name: new_axis}) 4190 ~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/indexes/base.py in drop(self, labels, errors) 5589 if mask.any(): 5590 if errors != "ignore": -> 5591 raise KeyError(f"{labels[mask]} not found in axis") 5592 indexer = indexer[~mask] 5593 return self.delete(indexer) KeyError: "['order_id' 'local_time' 'country_code' 'payment_status' 'n_of_products'\n 'store_address'] not found in axis"
pred = grid_search.predict(X_test)
testSet['final_status']=pd.DataFrame(pred)
testSet
| products_total | hour | AR | DO | EC | EG | ES | FR | IT | MA | PA | PE | TR | UA | NOT_PAID | PAID | prediction | final_status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 61.63 | 17 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |
| 1 | 15.99 | 18 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 2 | 5.89 | 22 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |
| 3 | 7.85 | 22 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 4 | 4.75 | 12 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 1 |
| 5 | 14.28 | 11 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |
| 6 | 14.35 | 18 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 7 | 4.42 | 20 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 1 |
| 8 | 1.79 | 18 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 |
| 9 | 29.89 | 13 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |
| 10 | 1.47 | 9 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 1 |
| 11 | 4.20 | 21 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |
| 12 | 18.73 | 11 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 1 |
| 13 | 3.80 | 20 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |
| 14 | 6.25 | 14 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 15 | 14.50 | 21 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |
| 16 | 21.10 | 18 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |
| 17 | 9.85 | 20 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |
| 18 | 0.77 | 13 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |
| 19 | 2.89 | 22 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |
| 20 | 4.30 | 16 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |
| 21 | 26.22 | 15 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
| 22 | 4.44 | 14 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |
| 23 | 3.35 | 10 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 1 |
| 24 | 7.57 | 20 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 1 |
| 25 | 0.89 | 21 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |
| 26 | 20.40 | 16 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |
| 27 | 4.95 | 22 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |
| 28 | 7.58 | 17 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |
| 29 | 53.00 | 20 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |
testSet['prediction'].to_csv('prediction.csv')